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Abstract 

The simulation of lattice QCD on massively parallel computers stimulated the 
development of scalable algorithms for the solution of sparse linear systems. 
We tackle the problem of the Wilson-Dirac operator inversion by combining 
a Schwarz alternating procedure (SAP) in multiplicative form with a flexible 
variant of the GMRES-DR algorithm. We show that restarted GMRES is not 
able to converge when the system is poorly conditioned. By adding deflation 
in the form of the FGMRES-DR algorithm, an important fraction of the infor- 
mation produced by the iterates is kept between successive restarts leading to 
convergence in cases in which FGMRES stagnates. 



1. Introduction 

The simulation of Lattice QCD is among one of the most challenging prob- 
lems in computational science because of its enormous computational cost, see 
PQ, e.g.. With sufficiently powerful computers becoming available during the 
last fifteen years, the simulation of Lattice QCD including dynamical fcrmions 
became a reality. The de-facto standard algorithm used to accomplish this task 
is the Hybrid Monte Carlo (HMC) algorithm [2J. HMC introduces a fictious 
time, in which a dynamical system is evolved according to the system Hamilto- 
nian. In order to integrate the Hamiltonian equations of motion, the forces due 
to both, gauge field and pseudo-fermion field, have to be evaluated accurately. 
The by far most demanding numerical task is the inversion of the fermionic 
Dirac matrix, needed to evaluate the fermion force. In this paper we investi- 
gate a class of inverters for the fermionic Dirac matrix which is particularly well 
adapted to current high performance hardware architecture. Our work was pri- 
marily inspired by the architecture of the QPACE machine, see Oil], but it is 
of a general nature and applies to all architectures were data movements rather 
than arithmetic operations tend to be the limiting factor on performance. 
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The development of machine- aware algorithms has become particularly im- 
portant during the last years as computer architectures move towards dramati- 
cally increased floating point (FP) speed and increased on-chip parallelism. For 
reasons of cost and because of physical limitations, these improvements are not 
accompanied by comparable progresses in the bandwidths and latencies of main 
memory and processor interconnects. It is predicted that this situation will be 
even aggravated as we move towards the Exascale area, see [5], e.g.. 

An often used rough indicator of balance between processor and memory 
subsystems is the ratio between peak main memory bandwidth and peak floating 
point performance. In the past few years this value went down from rs 1 to 
« 1/8 and less for the CELL processor, modern CPUs and general purpose 
x86 systems. This increasing gap is the main reason for a line of computer 
science research on software techniques aimed at maximizing cache utilization. 
A famous example is [6]. 

The algorithm cost evaluation must take into account not only the FP op- 
erations (as it was traditionally done some years ago when FP was the most 
relevant bottleneck in numerical applications) , but has to carefully evaluate the 
amount and pattern of data movement among the levels of memory hierarchy 
and the network as these are now the most limiting factors. 

Algorithms that are more suitable for these increasingly unbalanced com- 
puter architectures are particularly attractive as they can give easier access to 
the FP performances offered today. 

We describe an algorithm for the Dirac- Wilson matrix inversion that is able 
to exploit the architectural features of modern machines. We show the conver- 
gence behaviour and demonstrate the performance with our implementation on 
the peculiar SFB-TR55 QPACE architecture based on the IBM PowerXCell8i 
processor with state of the art lattices. 

2. Domain decomposition and SAP 

The Wilson fermion matrix D w [7] describes a (periodic) nearest neighbor 
coupling on the 4-dimensional space-time lattice. Without further explanation, 
our desire is to solve the linear system 

D w x = b. 

Owed to the complexity and size of D w , a direct solution is not feasible and 
we therefore exploit the structure of D w . We decompose the overall lattice 
into sublattices, called domains. For each domain, we obtain a local Wilson 
fermion operator by simple restriction; at the local boundary of a domain, we 
remove the couplings with neighboring domains. To describe this formally, let 
n = (ni, n2,ri3 y rii) stand for a 4-tupel describing a lattice site and let L = {n : 
(0, 0, 0, 0) < n < (di — 1, d-x — 1, d^ — 1, d^ — 1)} denote the overall lattice. The 
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Dirac Wilson operator can then be written as 



[D w ] nm = 5 n , m -K ^X^ 1 + 7j) f7 n,A+3,m + C 1 - liWl-l/n-lmj > n > m e X > 

(1) 

where n± j is to be taken mod cL . 

A domain 2/ is characterized by the bounds , dj for its lattice sites which 
are given by L l = {n : (a[ , a l 2l a\, a\) < n < (d[ — 1, d\ — 1, d$ — l,d\ — 1)}. The 
local Dirac operator for domain L is thus given as 

[D% m = S n>m -K f 53 '(1 + TjO^n A+i,m + (! - / , m G ^ 

(2) 

where indicates that the sum is to be taken only over those contributions 
for which n + j ot n — j is in L e . 

Assuming that the local Wilson operators are easy to invert, we can perform 
the following basic domain decomposition iteration, known as the multiplicative 
Schwarz method in the domain decomposition literature [SI |S] ■ 



Algorithm 1: Multiplicative Schwarz method 

choose intitial guess x 
repeat 

for I — 1, . . . ,p do 

compute residual r for domain I: r l = (b — D^x) 1 
solve D l w y l — r 

update: replace part x e in x by y e 
until until convergence 



Herein, the superscript £ indicates that we only take the part which cor- 
responds to the subdomain £. Due to the nearest neighbor coupling of D w , 
computing r is a local operation which involves only those components of x 
which belong to subdomain L e , i.e. x e and those from neighboring subdomains 
which are at the boundary to domain L e . 

If we use a red-black coloring of the subdomains as depicted in Fig. [T] for a 
two-dimensional lattice, and we have an even number of sub-domains in each 
dimension of the lattice, then there is no coupling between subdomains of the 
same color. When we order all red (say) domains first, we see that in the 
algorithm above the values of the residual r e does not depend on whether or 
not we updated the part x J of the iterate for the other subdomains of the same 
color. If we proceed by colors, we end up with the following variant of the 
basic domain decomposition iteration which is termed the Schwarz alternating 
procedure (SAP) in [TU] . 
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Figure 1: domain decomposition in 2 dimensions 



Algorithm 2: SAP iteration 


choose intitial guess x 


repeat 




for all red domains i do 






compute residual r l for domain I: r l — (b — D w x) 1 






solve D^y 1 = r e 




update: replace part x e in x by y l for all red domains £ 




for all black domains I do 






compute residual r l for domain £: r l = (b — D w x) 1 






solve D^y 1 = r l 




update: replace part x in x by y for all black domains I 


until until convergence 
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3. SAP in computational practice 

In computational practice, the SAP method can rarely be used directly in 
the form given by Algorithm [2] On one hand it is usually too costly to solve the 
local systems D^y e = r e exactly. One rather computes an approximation to y 
using another, inner iterative method on the systems D^y £ = r e . Usually, the 
fastest overall method arises if one requires a fairly low accuracy in the inner 
iteration. For example, one requires that the approximate solution y reduces 
the initial residual by just a factor of 10, i.e. one requires 

\\r e -D e w y e \\ <0.1-||/||. 

Such a method is termed an inexact SAP method. 

On the other hand, even the exact SAP iteration may diverge for the Wilson 
Dirac operator when the hopping parameter k approaches the critical value 
k c . Indeed, most of the convergence theory for multiplicative Schwarz methods 
assumes that the operator is hermitian and positive definite, see [H H] , which is 
not the case for the Wilson Dirac operator. 

The usage of SAP methods is particularly suitable on parallel machines and 
in general on machines with a memory hierarchy. Provided that at least two 
subdomains are assigned to a machine node, there is no need for network com- 
munication during the solution of the Wilson equation on the subdomains, since 
all the necessary data is on the node. Data only needs to be communicated for 
the update of the residual. In practice, there are more than two subdomains 
per node giving the possibility to overlap network communication and com- 
putation. The required network bandwidth, and in particular latency, is less 
stringent than for a simple Krylov solver, since the approximate solution of the 
subdomain Wilson equation requires some iterations. The required machine 
performance on global network operations (i.e. global sum, barrier, etc.) is also 
typically reduced, since the frequency of such operations is lower than in the 
case of a Krylov solver. We have chosen SAP as a preconditioner to be used on 
the most time-critical solves on our machine QPACE. The QPACE torus net- 
work [1] is designed with a typical Krylov solver in mind and has therefore high 
bandwidth and low latency, nevertheless SAP is particularly appealing due to 
the peculiar memory hierarchy architecture of the IBM PowerXCell 8i Processor. 
On this processor, the fast on-chip memory is not managed automatically by the 
hardware as on standard processors (cache), but its control is completely left to 
the programmer. If the size of the subdomain is chosen such that it fits the fast 
memory, no main memory access is necessary for the subdomain solve, giving 
the possibility to achieve impressive sustained performance for the subdomain 
solves (up to 50% of the peak) [IT] . [12] . This feature of the SAP algorithm is 
interesting also for future architectures as the main memory performance will 
continue to improve at a lower rate than processor performance. 
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4. Domain decomposition preconditioner 



Exact or inexact SAP iterations can be used as a preconditioner in a Krylov 
subspace method like GMRES P3] or GCR [T3]. This will usually result in an 
iterative process which converges faster than the Krylov subspace method with- 
out preconditioning or the stand-alone SAP iteration. We will discuss GMRES 
and GCR in quite some detail later in section [5j At this point, we concentrate 
on aspects pertaining to the use of inexact SAP iterations in the context of 
preconditioning. 

Loosely speaking, the speed of convergence of a Krylov subspace method for 
a generic system 

Ax = b 

gets faster the closer A is to the identity. So, to accelerate the convergence one 
tries to construct (or implicitly use) a matrix M, the preconditioner, such that 
the matrix AM of the (right) preconditioned system 

(AM)y = b, x = My 

is close to the identity. In this sense, M should be an approximation to the 
inverse of A. Such M is given implicitly if we perform some steps of an iterative 
procedure like SAP. 

Each step of the Krylov subspace method requires one multiplication with 
A and one with M . If we perform exact SAP with a fixed number, s say, of 
iterative steps, the preconditioner M can be expressed as the truncated series 

3=0 

where the matrices Al and Ajj result from a splitting A = Al—Ajj of the matrix 
A. To be specific, for the SAP iteration from Algorithm [2j where A = D w , this 
splitting is based on the red-black permuted matrix D w given as 



D bb 
D rb 



D br 
D rr 



Herein, the blocks D 1 ^ and D r ^ are made up of all the local Dirac operators 
of the black and the red subdomains, respectively and the off-diagonal blocks 
contain the couplings between subdomains of different color. Then 



A, 



D bb 

w 
D rb 





jjrr 



A, 



-D br 




In inexact SAP we invert A^ only approximately, using some inner iteration 
on the diagonal blocks D^. If this inner iteration is non-stationary, as it is 
the case if we perform some steps of GMRES or MR [T5] , the implicitly given 
preconditioner changes at each iteration. This is a fact one has to account for 
when formulating the preconditioned Krylov subspace methods. In the linear 
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algebra literature such methods are termed flexible, see [THl [H] since they adapt 
themselves to situations where the prcconditioncr varies from one iteration to 
the next. 

The next chapter will first introduce the flexible restarted GMRES method 
before turning to the crucial subject of this paper, namely its acceleration via 
deflation. 

5. F-GMRES-DR 

To simplify the presentation we will use a generic notation in this chapter, 
i.e. the linear system to solve will be denoted 

Ax = b. 

A preconditioner will be denoted M . In a variable preconditioning context, 
where the preconditioner will depend on the current iterative step j, we denote 
the preconditioner by Mj . Note that Mj will usually not be given explicitly but 
rather be the result of some steps of a non-stationary iteration like inexact SAP. 

Particular aspects for (inexact) SAP preconditioning of the Wilson Dirac 
operator will be discussed later. 

5.1. GMRES and GCR 

We consider a generic non-singular linear system 

Ax = b with solution x* = A~ 1 b. 

Given an initial guess xq and its residual tq = b— Axq, the j-th Krylov subspace 
Kj(A,ro) is defined as 

Kj(A,r Q ) = span{r , Ar , . . . , A J_1 ro}. 

For ease of notation, we will often write Kj instead of Kj(A, r ). It is known that 
the solution = A~ x b is contained in the space xq + Kj m where denotes the 
smallest j for which Kj = Kj+i- Krylov subspace methods work by iteratively 
choosing approximations Xj from xq + Kj to approximate sc*. The GMRES and 
GCR method are Krylov subspace methods which both obtain the "optimal" 
Xj in the sense that the residual Vj = b — Axj satisfies the minimal residual 
condition 

INI = mi n llk-^ll- 

xGxo+Kj 

Note that for any subspace V of C n we have 

x = x + Sx minimizes ||6 — Ax\\ over xq + V ^ r — Ax _L AV, (3) 

where rg — Ax — b — A(xq + Sx) is the residual of x = xq + 5x. The equivalence 
([3| holds because the minimizer Sx is such that ASx represents the orthogonal 
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projection of b — A(xq + Sx) onto AV. So the minimal residual condition for 
GMRES or GCR is equivalent to 

r 3 LAK r (4) 

GMRES works by building a sequence of orthonormal vectors v±, . . . , Vj which 
span Kj(A, ro). The coefficients rji in Xj = xq + Ylj—i rjiVi are then obtained by 
solving a small (j + 1) X j least squares problem. GCR computes a sequence 
of orthonormal vectors w% , . . . , Wj which span AKj , and an auxiliary sequence 
Si, . . . , Sj with Wj = Asj. The iterate Xj is obtained by an update Xj = + 
ctjSj-i. Both, GMRES and GCR, suffer from the fact that the number of vectors 
stored and the cost for the orthogonalizations increase linearly with n. A remedy 
is to restart or truncate the methods. GCR requires twice as many vectors 
to store as GMRES, and its total arithmetic costs are slightly higher than in 
GMRES. If convergence is slow, GMRES is significantly more stable numerically 
than GCR as was proven in [T7]. There is one advantage of GCR over GMRES, 
however: While the GMRES algorithm has to be modified (slightly) to allow for 
variable preconditioning, GCR adapts itself "automatically" to such a situation. 
For this reason GCR was used in various contexts of variable preconditioners 
like, for example, [10] - In the present paper we consider situations where there 
is an additional need for (inexact) deflation in order to get sufficiently fast 
convergence. While this can be done elegantly and efficiently within the GMRES 
framework, there seems to be no such way for GCR. This explains why we focus 
on GMRES from now on. 

5.2. Flexible GMRES 

Standard (non-variable) right preconditioning of the equation Ax — b means 
that instead of looking for an iterate xj e x + Kj(A,r ) we now look for 
one in the space xo + MKj(AM,r ), where M stands for the preconditioning 
matrix. The idea is that with a good preconditioner M (a good approximation 
to A^ 1 ), the "search space" xo + MKj(AM,ro) contains substantially better 
approximations to the solution than x + Kj(A,ro). In right preconditioned 
GMRES [15] we obtain the j-th iterate 

Xj = .T + MSyj, where <% = a,Tgmin SyeK]{AMro) \\b- A(x + MSy)\\. (5) 

Right preconditioned GMRES uses the Arnoldi process to compute an or- 
thonormal basis of Kj(AM, r°), orthogonalizing AMv^ against all previous vec- 
tors vi , . . . , Vj ■ . The solution of ^ can then be obtained as the solution of a 
small (m + 1) x m least squares problem. We do not discuss this further here, 
since right preconditioned GMRES appears as a special case of flexible GMRES 
[T^] which we develop in all its details now. 

In a flexible context, the preconditioning matrix depends on the iterative 
step, i.e. we have a new preconditioning matrix Mj in each step j. In a "flexi- 
ble" Arnoldi process we thus orthogonalize AMjVj against all previous vectors 
V\ , . . . , Vj ■ It will turn out useful to also keep track of the preconditioned vectors 
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Algorithm 3: Flexible Arnoldi Process 



Input : A G C™ xn , b G C™, an integer m 

Output: V m+1 = [ Vl \ ... \v m+1 ] G C" x ( m+1 \ Z m = ... \z m ] G C" xm , 
ffm = (ft.,) e c m+lxm 

1 p = \\b\\ 

2 vi := 

3 for j = 1, . . . , m do 

4 
5 
6 
7 
8 

9 
10 



MjVj /* preconditioning of new basis vector */ 



"3 ■— lr± 3 u 3 

w := Azj 

for i = 1, . . . ,j do /* orthogon. against previous vectors */ 

hij := vfw 
w = w — hijVi 

hj+ij := H| 

Vj + \ := w/hj+ij /* normalization */ 



Zj = MjVj for future use when computing (flexible) GMRES iterates. So we 
end up with the flexible Arnoldi process described as Algorithm [3] 
The flexible Arnoldi relation 

AZ m = V m+ iH m (6) 

summarizes the computations of Algorithm [3j its j-th column representing 

3 

h 3 + h3 V 3 + l = Az 3 ~ X! 

i=l 

Note that the (to + 1) x m matrix H m is upper Hessenberg, i.e. its elements 
below the first subdiagonal, hij for i > j + 1, arc all 0. The columns Vj of V m +i 
are orthonormal, so we have V£ +1 V m+ i = I G c( m+1 ) x ( m+1 ). 

In the non- flexible case, i.e. Mj — M for all j, the orthonormal vectors 
Vi, . . . , Vj span the Krylov subspace Kj(AM, r<y), and the vectors = Mw,, i = 
1, span the search space MKj(AM, ro). In this case we can formulate (JsJ) 
without Z m to retrieve the standard Arnoldi relation 

(AM)V m = V m+1 H m . (7) 

In the case of a variable preconditioner, there is no immediate notion of a 
(preconditioned) Krylov subspace, but we can still use the space spanned by 
the vectors z,, i.e. iange(Z m ) as a search space. This is the approach taken in 
flexible GMRES (F-GMRES) [TB], where we obtain the iterates x m by imposing 
the minimal residual condition for x m G xq + range (Z m ). So we have to compute 

i] = argmm €G(> ||6 - A(x + Z m £)\\. (8) 
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Using the flexible Arnoldi relation ([6]) and the fact that V m +i has orthonormal 
columns, we have 

\\b-A(x +ZU)\\ = \\r -V m+ iHU\\ = WVm+iPet-Vm+iHmZW = W^-H^l 

So solving the (m + 1) x m least squares problem rj = argmin^ gCm ||/3ei — ifm^H 
gives the coefficient vector rj from which we retrieve the FGMRES iterate x m 
as 

x m = x + Z m r\. 

We note in passing that the (m + 1) x m least squares problem is usually 
solved via a QR- factorization of H m . Due to the upper Hessenberg structure 
of H m , this can easily be done in an iterative manner by updating the QR- 
decomposition of Hj to that of Hj+i using one additional Givens rotation. 
Moreover, it is possible to very cheaply update the norm of the residual of the 
j-th GMRES iterate without explicitly computing the iterate. This can be used 
to implement an early termination criterion in cases where the j-th (F) GMRES 
iterate is already accurate enough for j < m. Details can be found in [15], e.g.. 
See also section 1531 

5.3. Restarts and Deflation 

If larger values of m are required to obtain a sufficiently accurate approxi- 
mation x m , full F-GMRES as described in the previous section is not feasible 
since we have to store the 2m vectors Vj and Zj and the total cost for the or- 
thogonalizations in the flexible Arnoldi process grow like 0(nm 2 ). It is therefore 
common practice to use restarted F-GMRES. The integer m is fixed to a moder- 
ate value. One then goes through several cycles. Each cycle performs m steps of 
the full F-GMRES algorithm, with its initial guess given by the approximation 
to the solution computed in the previous cycle. 

While restarts overcome problems with storage and computational complex- 
ity, they may severely degrade the convergence behaviour, and it might even 
happen that the method stagnates at a large residual norm without achieving 
any further progress. For difficult problems, this phenomenon has been widely 
observed in the literature, and several cures have been proposed. One of the 
most appropriate ones is the deflated restart modification which we discuss now. 

Upon a restart, all the information contained in the search space K, built 
up so far is lost. In particular, in the new cycle we do not keep the residuals 
orthogonal to AJC. The idea of deflated restarts is to extract a low-dimensional 
subspace U of K which contains the most important information in order to 
speed-up convergence, and to at least maintain the residuals orthogonal to AU. 
The new cycle can then be characterized as an augmenting Krylov subspace 
method with 11 as the augmenting space. The crucial point is that, as we will 
see in the next section, it is possible to construct this deflating augmenting 
subspace in such a way that we do not need additional multiplications with A. 



10 



5.4- Augmentation by Deflation 

We first describe the ideas and technicalities of deflated restarts for the non- 



flexible case. The extension to FGMRES will be given in section 5.5 So for now 
we assume we have a fixed preconditioner M in every iteration, and to further 
simplifiy the notation we base our whole discussion on the non-preconditioned 
system Ax = b. The (right) preconditioned case is obtained by merely replacing 
A with AM everywhere. 

The presence of eigenvector components corresponding to small eigenvalues 
in a residual r induces a large contribution of these eigenvectors in the error 
e = A~ 1 r. It is thus desirable to chose the augmenting subspace U as one that 
contains approximations to eigenvectors belonging to small eigenvalues. Those 
can be approximated from the Krylov subspace built up in the previous cycle. 
If this is done in the right manner, one can establish an Arnoldi type process 
for the augmented spaces with little additional cost as compared to standard 
Krylov subspaces, so that one ends up with an efficient method. We now give the 
details, following [18], which in turn builds on [19j |20j [21] . The key ingredient 
is to use harmonic Ritz vectors [2"2"| . 

Definition 1. Given a matrix A £ C™ x " and a subspace JC C C™, a harmonic 
Ritz pair (A, v) with X 6 C, v £ K for A with respect to K. satisfies 

Av-Xv _L AJC. (9) 

Harmonic Ritz pairs are well suited to approximate eigenpairs with a small 
eigenvalue [22] . The following lemma shows how harmonic Ritz pairs can be 
computed in cases where K, is a Krylov subspace or an appropriately augmented 
Krylov subspace. 

Lemma 1. Let K, be a subspace o/C" of dimension m and assume that 

AIC C /C © (s) for some vector s£ C". 

Let the columns v\, . . . , v m +i ofV m +i £ c nx ( m + 1 ) ft e orthonormal vectors such 
that vi, . . . , v m span K, and v\, . . . , u m +i span /Cffi(s), so that there exists a full 
rank matrix H m £ c( m +!) xm with 



Moreover, let 

H m = 

Then: 



h H 



AV m = V m+1 H m . (10) 



with H m £C mxm ,h m £ 



a) The harmonic Ritz pairs of A w.r.t. K, are given as {X,V m g), where (X,g) 
is an eigenpair of the matrix 



H rn + f m hil, where H"f m = h 



■in ■ 
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b ) The residual Au — Xu of a harmonic Ritz pair satisfies 



Au — Xu G (r), 

where r spans the orthogonal complement of AK in K. © (s), i.e. 

AK.®± (r) =JC®(s). 



Proof. Using (10 1, the defining property ^ for the Ritz pair (A, v) with 
v = V m g turns into 

(AV m ) H {Av- Xv) =0 
& (V m+1 H m ) H \V m+1 H m g - XV m g) = 



XI„ 




.9 =0. 



Since H m has full rank, the matrix H m is non-singular. The last equality can 
thus be further reduced to 



{H%H m + h m h*) g - XH^g = 



[Hn 



fmh") g ■ 



Xg = 0, where H"f, 



which shows a). Part b) is trivial, since the residual of any harmonic Ritz pair 
is in AK, + (s) and is orthogonal to AK,. □ 

So, to obtain harmonic Ritz pairs, we first compute f m as the solution of an 
m x m linear system and then compute the eigenpairs of H m + f m hm- 

Note that by ^ the vector r from part b) of Lemma [l] is also the residual of 
the GMRES iterate with respect to the search space K.. Since in the situation 
of Lemma [T] we have 

AUeU + (r) (11) 

for any subspace IA spanned by some harmonic Ritz vectors, we see that the 
augmented Krylov subspaces Kj(A, r,U) := U + Kj(A, r) satisfy 

A ■ (U + Kj(A, r)) C U + K J+1 {A, r). 

This has two major consequences. The first is that, starting from an orthonor- 
mal basis of Ki(A,r,U), we can build nested orthonormal bases for the spaces 
Kj(A, r,Li) in an Arnoldi type manner. The second is that we can compute har- 
monic Ritz vectors with respect to K m (A,r, U) in the way given in Lemma [l] 

This is why we are able to set up a restarted GMRES method with deflation 
of harmonic Ritz vectors, where each cycle consists of the following steps: 

1. Extract k harmonic Ritz pairs from the "search space" U pr + K m (A,r pr ) 
built in the previous cycle; these harmonic Ritz vectors span the current 
augmenting subspace U. 



12 



2. With r denoting the residual of the iterate x at the beginning of the current 
cycle, compute nested orthonormal bases V\,... ,v k +j for Kj(A,r,U), j — 
1, . . . ,to, such that for Vk+j = [v\ \ . . . I^fc+j] we have the Arnoldi type 
relation 



k+j 



Vt 



k+] + \Hk+j 



H, 



k+j 



c (fc+j+l)x(fc+j)^ 



(12) 



3. Compute the current GMRES iterate and its residual using (12) 



Here, as in the sequel, we use the superscript pr to denote quantities from 
the previous cycle; no superscript refers to the current cycle. For an efficient 
implementation, all three steps are coupled by the way in which we construct 
the orthogonal basis for the current search space Kj(A,r, IX). We now discuss 
the three steps in detail. 

Extracting the Ritz pairs. Let VJ~ r , to = fc+TO, denote the matrix containing the 
orthonormal basis vectors of the search space built up in the previous cycle. By 
Lemma [l^,), the k harmonic Ritz vectors ue,£ = 1, . . . , k to augment the current 
Krylov subspace can be computed using H~ r from the Arnoldi-type relation 



( 12 ) of the previous cycle. The harmonic Ritz vectors are then given as 



ut = V£ gi , 9/6(7^ = 1 ft, 

which we summarize as 

U k := [«i| • • . K] = Vg r G k , G k = [ 9l \... \g k }. 



(13) 



Computing the next GMRES iterate and its residual.. Assume that we have 
already built V k +m+li the columns which represent an orthonormal basis of the 
current search space. We have to compute n E C +m such that ||6 — A(x + 
Vk+mV)\\ is minimal. The residual r = b — Ax of the current iterate, which is in 
Ki(A,r,U pr ) and thus in range(T4+i), can be expressed as 



b — Ax = Vk+m+ic-, where c ; 







Taking (12) as granted for the moment, we have 

b - A(x + V m+k r)) = V m+ k+i (c - H m+k r] 

Since the columns of V m +k+i are orthonormal, minimizing b — A(x + V m +kTl) 
is therefore equivalent to solving the small (m + k + 1) x (to + k) least squares 
problem 

T] = argmin c - H m+k r] . 
This gives the next GMRES iterate x next = x + V m+k n with residual 
r next = V m+k+ i(c - H m+k 7]). 



(14) 
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Building the orthonormal basis for the current search space. The crucial part 
here is to obtain an orthonormal basis for K i (A, r, Li) without investing multi- 
plications with A. 

From the columns of being orthonormal, we obtain an orthonormal 
basis v\, . . . , v k of IA by computing an orthonormal basis for range Gk from ( 13 ), 
i.e. by computing the QR-factorization Gk — QkRk, where Qk € C mxk has 
orthonormal columns, and putting 



V k := H 



\Vk_ 



From Q we have AU C hi + (r) , where r is the residual of the iterate at the 

' pr -H~'v pr ) according to (14). 



beginning of the current cycle, r 
Defining G k+1 € c(" l+1 ) x ( fe+1 ) as 

Gk+i 



- v pr 



Gu 



m + l 



ra I 



(15) 



we thus have that range(l / ? |^_ 1 G'/ £ +i) = U + (r). So we can extend vi, . . . ,v k to 
an orthonormal basis v±, . . . , Wfc+i of IA + (r) by extending the QR-factorization 
Gk = QkRk to a QR factorization Gk+i = Qk+iRk+i where 



Qk+i — 

This gives the relation 



Qk * 
* 



^.(fn+i)x(fe+i) 



k+1 



\v k \v k +i] = V r pr 



+iVfc+i- 



(16) 



On the other hand, by ( |ll| ) for j = there exists a matrix iJfc G c(fc+i)xfc 
such that 

A^ fc = Vk+iHk- 



This matrix is actually given as 

H k = Qk +1 H^Q k , 
which can be seen by comparing 

AV k = AV pr Q k = V P l,H pr Qk 

*k r m^ K m+l m^ K 

which follows from the Arnold! relation of the previous cycle and 
AV k = Vk+iHk = VS r Q k+1 H k , 



(17) 
(18) 



which is due to (161 and (17). We have therefore shown (12) for j = 0. 

The Arnoldi-type relation ( 17) for an orthonormal basis of U + K\{A, r) can 
now be extended to an orthonormal basis of U + Kj (A, r) through the standard 
Arnoldi orthogonalization procedure, where in step j we orthogonalize Avk+j 
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against all previous basis vectors. As a consequence, the matrices Hk+j in the 
resulting Arnoldi-type relation ( 12 ) have the form 





H k 


hi,k+i 
hk+i,k+i 


hi t k+j 

hk+i,k+j 


H k +j - 




hk+2,k+i 


hk+2,k+j 




Ojxk 





■ hk+3,k+j 









hk+j+x,k+j 



C (fe+j+l)x(fc+i) 



which has u ppe r Hessenberg structure except for the first k + 1 rows. Herein, 
Hk is from (18); for the other entries hu see Algorithm Q below. 



5.5. Deflating flexible GMRES 

If we use a constant preconditioner M, the deflating procedure described so 
far can be directly extended to right preconditioned GMRES by just changing 
the operator from A to AM. Note that we now compute harmonic Rltz values 
and vectors for AM rather than A. 

As was pointed out in [331 HUi g° m g from constant to varying precondi- 
tioning, is conceptionally almost trivial: For each j the preconditioner Mj oc- 
curs only once in the matrix- vector product Zj — MjVj of the flexible Arnoldi 
process. Since the vectors Vj are linearly independent — they are even mutu- 



ally orthogonal — there exist a constant matrix M for which Mvj 



M jVj 



for 



j = 1, . . . , m. We don't have to know M explicitly, all we need is its action on 
the Vj which is given through the variable preconditioning. So we can perform 
the deflation procedure in exactly the same way as in the case of a constant 
preconditioner. 

Algorithmically, we have to take care of storing the preconditioned vectors 
Zj = MjVj explicitly, as it is done in the flexible Arnoldi process. The resulting 
deflated, flexible restarted GMRES method (FGMRES-DR) is formulated as 
Algorithm [I] 

5. 6. Computing the implicit norm of the residual within GMRES-DR 

As noted in section [53} the QR factorization of H can be updated at each 
step using Givens rotations. The implicit norm of the residual is easily computed 
by applying in sequence the rotations to the right hand side c of the little 
least squares problem and computing the absolute value of the last element 
of d = Q H c. The set of rotations form the unitary matrix Q H such that 
Q^ +1 -ff m +i = R m with R m triangular. The application of to c is easily 

seen as a change of basis. This new basis is composed by a set of orthonormal 
vectors which spans the subspace inverted by the iteration, plus a vector which 
is not inverted. The first m vectors of the new basis V m +\ satisfy AZ m — 
V m R. The m + 1 element of c' is the component of b on the non- inverted, 
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Algorithm 4: F-GMRES-DR 



Input : A E C™ xn , b G C™, restart length m, dimension of augmenting 
(harmonic Ritz) subspace k 

1 choose initial guess Xq, compute r = b — Ax° 

/* first cycle: flexible GMRES, no deflating subspace */ 

2 Perform m steps of the flexible Arnoldi process starting with tq. This 

gives Z m , V m+ i,H m from M. 

3 Put c = Urollex G C m+1 . 

4 Compute rj = argmin^ gCm \\c — H m ^\\ 

5 Obtain GMRES iterate x = x + Z m r\ and residual r = r — V m+ iH m r) 

6 repeat /* all other cycles */ 
/* get augmenting (harmonic Ritz) subspace */ 

Compute all m eigenpairs (Aj,Pi) of H m + / m ^m /* Lemma [l^i) */ 
Identify the k smallest (in modulus) harmonic Ritz values Ai, collect 
the corresponding Ritz vectors gi as columns of Gk G C mxfc . 

/* initialization for augmented Arnoldi process */ 
G k 



10 

n 

12 
13 
14 
15 
16 
17 

18 
19 

20 



Build G k+l = 







c- H m T] 



from (151 and compute its 



QR-factorization G k +i = Qk+iRk+i- 

Update Hk := Q^ +1 H m Qi : , where Qk results from Qk+i by deleting 

its last row and column. 

Update V" fc+ i = V m+1 Q k+1 ,Z k = Z m Q k . 

/* Remaining part of augm. flex. Arn. */ 
for j = k + 1, . . . , m do 
Zj := MjVj 



:= A; 



for i 

h 



1, . . . , j do 



w = w 



= \M\ 

v 3+1 := w/hj+xj 
/* express (old) residual in terms of new basis */ 



Compute c = 



fc+i' 




/* so v = V m+ ic */ 



/* Now obtain iterate and residual for this cycle */ 

Compute rj — argmin^ eC ,„ \\c — H m £\\ 

Update GMRES iterate x = x + Z m r\ and corresponding residual 
r = r - V m+ iH m r) 
23 until convergence 
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normalized vector v m +i- ln GMRES-DR the matrix H is not Hessenberg. It is 
however possible to compute the implicit norm of the residual by computing a 
QR factorization of the non-Hesseberg submatrix of H and then updating the 
QR factorization by Givens rotations. 

5. 7. Mixed precision 

On most of modern CPUs, GPUs as well as on the CELL processor, single 
precision arithmetic is twice as fast as double precision arithmetic. The natural 
way of exploiting the possible acceleration provided by single precision arith- 
metic units, while aiming at a full double precision result when solving a linear 
system, is by using the technique of iterative refinement, see, e.g., [25j . 

Given two different available machine precisions, a high precision p^ and 
a low precision pi, iterative refinement is based on the algorithmic principle 
described in Algorithm [5j 



Algorithm 5: Iterative Refinement 



choose initial guess Xq /* precision p^ */ 

1 compute ro = b — Ax° with precision ph 

2 repeat /* refinement cycles */ 

3 convert ro to precision pi, r <— ro 

4 solve Ax = r up to precision p\ using p\ 

5 convert x to precision p^ and update Xq ^ x + x 

6 compute r = b — Ax° with precision p^; 

7 until convergence 



In practice, where we work with IEEE single (pi = 2~ 24 s» 10 ) and double 
(jOh = 2~ 5 « 1CP 16 ) precision and where we aim at a final norm of the residual 
of the order of 10~ 12 , for example, two cycles are usually sufficient to obtain 
the desired accuracy. However, if the solver used in each refinement cycle is 
deflated flexible GMRES, we may encounter problems if we want to use the 
harmonic Ritz vectors from the last GMRES cycle of the current refinement 
cycle as a deflating subspace for the first GMRES cycle in the next refinement 
cycle: Upon convergence in low precision, the updated, low precision residual 
obtained via FGMRES-DR and the explicitly computed high precision residual 
may differ substantially. Then, including the more precise, explicitly computed 
high precision residual into the next deflated GMRES cycle is not possible, even 
when converted to low precision, because the fundamental relation ( 11 ) between 
the residual and the harmonic Ritz vectors, which is at the heart of the efficient 
use of deflation in FGMRES-DR, is lost. 

To cope with this situation, we developed the following approach: Our re- 
finement cycles correspond one-to-one to the restart cycles of FGMRES-DR, so 
that we have more than just the 2 or 3 usual refinement cycles. Still, of course, 
the overwhelming part of all computation is done in low precision arithmetic. 
We then recompute the residual r — b — Ax of the current iterate x explicitly 
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in high precision and convert it back to low precision. This explicitly computed 
residual r is used in Algorithm [4] to obtain the right hand side c of the least 
squares problem that has to be solved in each cycle of FGMRES-DR (last 4 
lines), and for c — H m rj when obtaining the matrix Gk+i at the beginning of 
the repeat-loop. We use r also when updating the last vector of Vk+i- This last 
vector corresponds to the residual after orthonormalization against the updated 
Vk ■ Our approach can thus be regarded as an attempt to sneak in exact residual 
information via the right hand side of the least squares problem and one vector 
of the subspace while at the same time maintaining the crucial relation (111 
although it will probably not hold exactly after using r to get vt+i- 

This approach works satisfactorily well in practice. Nevertheless, it may 
still happen that after some cycles the updated and the explicitly recomputed 
residual differ by more than a small amount. In this case, we perform a "clean" 
restart of the method, i.e. we start Algorithm [4] from scratch with a first cycle 
that does not use any deflating subspace. As a criterion for triggering the clean 
restart we use ||»" e ||/||7" 4 || > i, with r e being the explictly reco mpu ted residual, 
r l being the implicit residual computed as described in section 5J3 and t being 
a tunable threshold. From our experiments, 2 < t < 10 is a reasonable search 
range. 



6. Numerical results 

We now report results of several numerical experiments which illustrate the 
impact of including deflation into the restarted flexible GMRES method. Our 
target systems are dynamically produced Wilson-Dirac operators with clover 
improvement [26] . We start with the results of a series of experiments for a 
relatively small 16 3 x 32 lattice. We used a thermalized configuration obtained 
from a dynamical simulation at (3 = 5.29 and K sca = 0.135. The value of c aw 
in the clover improvement was 1.91; the resulting critical value for the hopping 
parameter n was k c ss 0.13707. 

Our implementation used 4 nodes of QPACE, i.e. 32 CELL SPU cores in 
total. For the domain decomposition, we divided the lattice into 64 sublattices 
of size 4 3 x 8 each, so that each of the 32 cores is assigned 8 of these sublattices. 
We always performed 8 SAP-cycles for preconditioning, and the solves for the 
subdomains were done approximately via 5 steps of MR, independently of the 
subdomain and of the accuracy of the solution thus obtained. This choice of 
parameters was found to be the best by an extensive numerical study. 

Figure [2] shows the results for flexible GMRES without deflation (left) and 
with deflation (right) for various choices of the cycle length m. Here, as every- 
where, we plot the 2-norm of the residual against the number of matrix-vector 
multiplications invested, i.e. for every "interior" step of the restarted GMRES 
cycle. In these experiments the hopping parameter was set to ft = 0.1368, which 
is still quite far from k c , so that we may consider the system as relatively well 
conditioned. The figures illustrate the fact that a larger value of the cycle length 
m results in fewer iterative steps. Deflation results in fewer iterations when the 
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Figure 2: Results without (left) and with (right) deflation. 16 3 X 32 configuration, k c RJ 
0.13707, k = 0.1368. The legend on the left figure refers to the cycle length m; in the right 
figure the pair (m, k) denotes the cycle length m and the size of the deflating subspace k. 

cycle length is fixed. Note that the arithmetic cost related to the Arnoldi pro- 
cess grows like nm 2 due to the orthogonalizations, so larger cycle lengths m can 
significantly increase the computing time. This being said, Figure [2] shows that 
by investing just 4 vectors for deflation, we get a better performance by using 
a restart value m — 16 than by using a restart value m twice as large and no 
deflation. 

Figure [3] presents similar experiments for a different value of the hopping 
parameter, k = 0.1371, which is very close to the critical value. Now the 
system is much more ill conditioned, and we see that deflation starts to have an 
even more significant impact on the performance of the method. For example, 
reserving k = 4 vectors for deflation while using a restart value of m = 16 reduces 
the required iterations from about 450 to 115. The best deflated method uses 
m = 20 and deflates k = 8 vectors. It requires roughly only half as many 
iterations as the best non-deflated method which requires a much larger cycle 
length, m = 32. In some of the residual plots in the figure to the right we 
observe an increase of the residual around iterations 50-70 for some choices of 
the parameters. At these points we had to do a "clean" restart of FGMRES- 
DR, and the jump in the residual indicates the that the explicitly computed 
residual is considerably larger than the updated one. Since a clean restart 
discards all subspace information acquired so far, the subsequent GMRES cycle 
is comparably as slow as the very first cycle of the iteration. This shows that it 
might be advisable to use quite small values for the subspace dimension fc, since 
in these cases a clean restart was never necessary, thus resulting in the fastest 
overall convergence. 

In a last experiment for this configuration we set k — 0.1374, which is beyond 
the critical value. The SAP iteration by itself is divergent for this choice of 
k (see Figure [5]) , and when used as a preconditioncr for GMRES we observe 
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Figure 3: Results without (left) and with (right) deflation. 16 3 x32 configuration, k c w 0.1371, 
K = 0.1371. 
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Figure 4: Results without (left) and with (right) deflation. 16 3 x32 configuration, k c 0.1371, 
k = 0.1374. 
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Figure 5: Convergence and divergence of the SAP method as a stand-alone solver for various 
values of k 



complete stagnation unless the restart value m is large enough (m = 48). On 
the other hand, deflation with a relatively small dimension of the deflating 
subspace results in quite fast convergence for moderate values of the restart 
value m. This indicates that the divergence of the pure SAP iteration is due to 
the fact that just a few eigenvectors belonging to the corresponding iteration 
matrix are troublesome and that they are effectively removed by deflation. 

The second set of experiments consists of inversions of a state-of-the-art 
48 3 x 64 Nf — 2 lattice configuration thermalized at /3 = 5.40, K soa = 0.1366 
and c sw = 1.8228, while n c rj 0.13670 

In this case a full 256 nodes QPACE rack was used. The machine is config- 
ured as a 3D 4x8x8 torus and we used an SAP block size of 8x2x6x6 
that fits the 8 SPU local stores nicely and satisfies all the lattice and imple- 
mentation constraints. For the test inversions presented, we have chosen to use 
aggressive preconditioner settings that are particularly advantageous on mas- 
sively parallel machines. These settings reduce the overhead of global sums and 
scalar products — global sums being limited by network latency and local scalar 
products by memory bandwidth. In the preconditioner, we therefore performed 
24 SAP cycles with 6 MR iterations for the subdomains. Figure [6] shows that 
for k = 0.13663, using a small dimension for the deflating subspace, i.e., k = 3, 
and keeping the restart length fixed to m — 18, the iteration number decreases 
from 183 to 159, i.e., by about 15%. For the larger value of K = 0.13666, de- 
flating 3 vectors halves the iteration numbers from approximately 500 down 
to 250. Increasing the cycle length does not substantially affect the deflated 
method. Figure [7] shows results for larger values of n. For k = 0.13668 non- 
deflated restarted GMRES almost stagnates even if we use a relatively large 
cycle length, m = 36. On the other hand, deflating 3 or 6 vectors cures this 
stagnation completely, and we can even reduce the cycle length down to 24 or 
even 16. A similar observation holds for k = 0.13670. 
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Figure 6: Results with and without deflation for different solver parameters. 48 3 X 64 config- 
uration, ft = 0.13663 (left), re = 0.13666 (right). 




Figure 7: Results with and without deflation for different solver parameters. 48 3 X 64 config- 
uration, ft = 0.13668 (left), ft = 0.13670 (right). 
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In Table [T] we report the timings on QPACE for all our computations on 
48 3 x 64 configurations. These numbers show that wall clock time follows the 
iteration counts very closely. For the range tested, the cycle length m by itself 
has only a marginal influence. This is due to the fact that our code spends most 
of its time in the 24 SAP iterations of the preconditioner, so that the work for 
the Arnoldi process, which increases with m, is relatively cheap for any of our 
choices for m. 



kappa 


m 


k 


iterations 


time (s) 


0.13663 


18 





183 


39.0 


0.13663 


18 


3 


159 


35.9 


0.13666 


18 





507 


107.1 


0.13666 


18 


3 


255 


56.7 


0.13666 


24 


3 


252 


57.4 


0.13668 


36 





3464 


741.9 


0.13668 


16 


6 


302 


70.4 


0.13668 


18 


6 


326 


74.9 


0.13668 


24 


6 


325 


73.7 


0.13668 


24 


3 


315 


69.9 


0.13670 


48 





2869 


623.6 


0.13670 


48 


4 


409 


92.3 


0.13670 


24 


4 


448 


100.0 


0.13670 


24 


6 


461 


106.1 


0.13670 


18 


6 


441 


102.0 



Table 1: Timings and iterations for the 48 3 X 64 configuration for the tested re and solver 
settings combinations 



7. Conclusions 

While SAP can be implemented efficiently on current supercomputers with 
a deep memory hierarchy and many compute cores like QPACE, it is not neces- 
sarily a convergent iteration, particularly if the hopping parameter k is close to 
the critical value. This can very severely impede the convergence of SAP pre- 
conditioned restarted flexible GMRES or restarted GCR. We have introduced 
deflated restarted flexible GMRES as a (Clover) Wilson fermion solver to cure 
this problem. Numerical results obtained for thermalized configurations of size 
up to 48 3 x 64 show that using a very small dimension for the deflating sub- 
space (never more than 6) always results in satisfactory convergence speeds. An 
additional benefit is that deflation allows to use relatively small cycle lengths 
(in our experiments, m — 16 to 24 was always sufficient), which reduces the 
arithmetic cost of the Arnoldi process and, more importantly, keeps memory 
requirements low. The benefits of deflation are most prominent when the hop- 
ping parameter k is close to the critical value k c or even larger than k c , so that 



23 



deflation is particularly helpful on exceptional configurations arising during an 
HMC simulation. 

The parameter space on which the performance of SAP preconditioned de- 
flated restarted flexible GMRES can be optimized, is very large: the size and 
shape of the subdomains for SAP, the method and number of iterations to use for 
the (approximate) subdomain solves, the number of SAP iterations per precon- 
ditioning step, the cycle length m and the deflating subspace dimension k. We 
could therefore not perform a complete investigation of this parameter subspace. 
Rather, our approach was to fix parameters related to the SAP preconditioner 
by efficiency considerations for our given hardware, and then find good values 
for k and m. Although this quite surely means that we missed the overall best 
method, all our results consistently indicate that it is always worth to include a 
(small) deflating subspace: Convergence is faster for the same cycle length, and 
for hard problems, convergence is enabled at all. We thus think that deflated 
restarted flexible GMRES should establish itself as the standard successor to 
flexible restarted GMRES or GCR and even more to preconditioned BiCGstab 
for which we observed convergence problems relatively often during simulations 
for large configurations. 

We note that an alternative approach to the method presented here are 
multilevel methods like adaptive algebraic multigrid, see [23 [55] or Liischers 
deflated domain decomposition method [35] and its true multilevel variants [3D] . 
As compared to these approaches, the deflated GMRES method docs not require 
a set-up phase which can be relativley costly for the other methods if only one 
or a few right hand sides have to be solved. It has also the advantage of being 
quite simple to implement and to fit modern supercomputer architectures well. 
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